Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Jiyin (Jenny) Zhang, UID: 606331859

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.2       evaluate_0.23    

Load necessary libraries.

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  8.000 GiB 
Freeram:   1.672 GiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer

sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
  filter(subject_id == 10013310) |>
  filter(eventtype != "discharge") |>
  mutate(
    linewidth = ifelse(
      str_detect(careunit, "(ICU|CCU)"), 3, 1
    )
  ) |>
  print(width = Inf)
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 11 × 8
   subject_id  hadm_id transfer_id eventtype
        <dbl>    <dbl>       <dbl> <chr>    
 1   10013310 21243435    31736720 ED       
 2   10013310 21243435    33511674 transfer 
 3   10013310 21243435    34848129 transfer 
 4   10013310 21243435    38910974 admit    
 5   10013310 22098926    31651850 transfer 
 6   10013310 22098926    32769810 admit    
 7   10013310 22098926    33278851 transfer 
 8   10013310 22098926    34063502 ED       
 9   10013310 27682188    30077870 transfer 
10   10013310 27682188    31203589 admit    
11   10013310 27682188    35160955 ED       
   careunit                                        intime             
   <chr>                                           <dttm>             
 1 Emergency Department                            2153-05-26 08:56:00
 2 Medicine/Cardiology                             2153-05-26 16:19:26
 3 Medicine/Cardiology                             2153-05-26 14:42:55
 4 Medicine/Cardiology                             2153-05-26 14:18:39
 5 Neuro Intermediate                              2153-06-12 16:31:33
 6 Neuro Surgical Intensive Care Unit (Neuro SICU) 2153-06-10 11:55:42
 7 Medicine                                        2153-06-16 19:03:14
 8 Emergency Department                            2153-06-10 10:40:00
 9 Medicine/Cardiology                             2153-05-07 20:47:19
10 Coronary Care Unit (CCU)                        2153-05-06 18:28:00
11 Emergency Department                            2153-05-06 10:21:00
   outtime             linewidth
   <dttm>                  <dbl>
 1 2153-05-26 14:18:39         1
 2 2153-06-05 19:58:00         1
 3 2153-05-26 16:19:26         1
 4 2153-05-26 14:42:55         1
 5 2153-06-16 19:03:14         1
 6 2153-06-12 16:31:33         3
 7 2153-07-21 18:02:28         1
 8 2153-06-10 11:55:42         1
 9 2153-05-13 15:36:52         1
10 2153-05-07 20:47:19         3
11 2153-05-06 18:28:00         1
sid_proce <- read_csv("~/mimic/hosp/procedures_icd.csv.gz",
  show_col_types = FALSE
) |>
  filter(subject_id == 10013310) |>
  print(width = Inf)
# A tibble: 9 × 6
  subject_id  hadm_id seq_num chartdate  icd_code icd_version
       <dbl>    <dbl>   <dbl> <date>     <chr>          <dbl>
1   10013310 21243435       1 2153-05-27 4A023N7           10
2   10013310 21243435       2 2153-05-27 B2111ZZ           10
3   10013310 21243435       3 2153-05-27 B241ZZ3           10
4   10013310 22098926       1 2153-06-10 03CG3ZZ           10
5   10013310 22098926       2 2153-06-10 3E05317           10
6   10013310 22098926       3 2153-07-15 0DH63UZ           10
7   10013310 22098926       4 2153-06-11 3E0G76Z           10
8   10013310 27682188       1 2153-05-06 027034Z           10
9   10013310 27682188       2 2153-05-06 B211YZZ           10
sid_lab <- open_dataset("labevents_pq/labevents.parquet", format = "parquet") |>
  filter(subject_id %in% 10013310) |>
  collect() %>%
  print(width = Inf)
# A tibble: 2,285 × 16
   labevent_id subject_id hadm_id specimen_id itemid order_provider_id
         <int>      <int>   <int>       <int>  <int> <chr>            
 1      153564   10013310      NA     4841989  50887 ""               
 2      153565   10013310      NA     8958046  50934 ""               
 3      153566   10013310      NA     8958046  50947 ""               
 4      153567   10013310      NA     8958046  51003 ""               
 5      153568   10013310      NA     8958046  51678 ""               
 6      153569   10013310      NA    10682517  50933 ""               
 7      153570   10013310      NA    11713499  51133 ""               
 8      153571   10013310      NA    11713499  51146 ""               
 9      153572   10013310      NA    11713499  51200 ""               
10      153573   10013310      NA    11713499  51221 ""               
   charttime           storetime          
   <dttm>              <dttm>             
 1 2153-05-06 03:30:00 NA                 
 2 2153-05-06 03:30:00 2153-05-06 04:22:00
 3 2153-05-06 03:30:00 2153-05-06 04:22:00
 4 2153-05-06 03:30:00 2153-05-06 04:41:00
 5 2153-05-06 03:30:00 2153-05-06 04:22:00
 6 2153-05-06 03:30:00 NA                 
 7 2153-05-06 03:30:00 2153-05-06 04:09:00
 8 2153-05-06 03:30:00 2153-05-06 04:09:00
 9 2153-05-06 03:30:00 2153-05-06 04:09:00
10 2153-05-06 03:30:00 2153-05-06 04:09:00
   value                                    valuenum valueuom ref_range_lower
   <chr>                                       <dbl> <chr>              <dbl>
 1 HOLD.  DISCARD GREATER THAN 24 HRS OLD.     NA    ""                  NA  
 2 5                                            5    ""                  NA  
 3 2                                            2    ""                  NA  
 4 ___                                          2.97 "ng/mL"              0  
 5 14                                          14    ""                  NA  
 6 HOLD.  DISCARD GREATER THAN 4 HOURS OLD.    NA    ""                  NA  
 7 1.90                                         1.9  "K/uL"               1.2
 8 0.2                                          0.2  "%"                  0  
 9 0.1                                          0.1  "%"                  1  
10 32.5                                        32.5  "%"                 34  
   ref_range_upper flag       priority comments
             <dbl> <chr>      <chr>    <chr>   
 1           NA    ""         STAT     "___"   
 2           NA    ""         STAT     ""      
 3           NA    ""         STAT     ""      
 4            0.01 "abnormal" STAT     "___"   
 5           NA    ""         STAT     ""      
 6           NA    ""         STAT     "___"   
 7            3.7  ""         STAT     ""      
 8            1    ""         STAT     ""      
 9            7    "abnormal" STAT     ""      
10           45    "abnormal" STAT     ""      
# ℹ 2,275 more rows

Find the patient’s demographic information:

sid_patients <- read_csv("~/mimic/hosp/patients.csv.gz", 
                         show_col_types = FALSE) |>
  filter(subject_id == 10013310) |>
  print(width = Inf)
# A tibble: 1 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10013310 F              70        2153 2017 - 2019       2153-11-19

Find the race of the patient:

sid_admission <- read_csv("~/mimic/hosp/admissions.csv.gz", 
                          show_col_types = FALSE) |>
  filter(subject_id == 10013310) |>
  select(race) |>
  print(width = Inf)
# A tibble: 3 × 1
  race         
  <chr>        
1 BLACK/AFRICAN
2 BLACK/AFRICAN
3 BLACK/AFRICAN
sid_proce2 <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz", 
                       show_col_types = FALSE) |>
  print(width = Inf)
# A tibble: 85,257 × 3
   icd_code icd_version
   <chr>          <dbl>
 1 0001               9
 2 0002               9
 3 0003               9
 4 0009               9
 5 001               10
 6 0010               9
 7 0011               9
 8 0012               9
 9 0013               9
10 0014               9
   long_title                                                 
   <chr>                                                      
 1 Therapeutic ultrasound of vessels of head and neck         
 2 Therapeutic ultrasound of heart                            
 3 Therapeutic ultrasound of peripheral vascular vessels      
 4 Other therapeutic ultrasound                               
 5 Central Nervous System and Cranial Nerves, Bypass          
 6 Implantation of chemotherapeutic agent                     
 7 Infusion of drotrecogin alfa (activated)                   
 8 Administration of inhaled nitric oxide                     
 9 Injection or infusion of nesiritide                        
10 Injection or infusion of oxazolidinone class of antibiotics
# ℹ 85,247 more rows
sid_dia <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz", 
                    show_col_types = FALSE) |>
  filter(subject_id == 10013310) |>
  print(width = Inf)
# A tibble: 71 × 5
   subject_id  hadm_id seq_num icd_code icd_version
        <dbl>    <dbl>   <dbl> <chr>          <dbl>
 1   10013310 21243435       1 I222              10
 2   10013310 21243435       2 I5023             10
 3   10013310 21243435       3 I428              10
 4   10013310 21243435       4 E1142             10
 5   10013310 21243435       5 E1165             10
 6   10013310 21243435       6 I213              10
 7   10013310 21243435       7 I110              10
 8   10013310 21243435       8 I2510             10
 9   10013310 21243435       9 M25511            10
10   10013310 21243435      10 E785              10
# ℹ 61 more rows
sid_dia2 <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz", 
                     show_col_types = FALSE) |>
  print(width = Inf)
# A tibble: 109,775 × 3
   icd_code icd_version long_title                           
   <chr>          <dbl> <chr>                                
 1 0010               9 Cholera due to vibrio cholerae       
 2 0011               9 Cholera due to vibrio cholerae el tor
 3 0019               9 Cholera, unspecified                 
 4 0020               9 Typhoid fever                        
 5 0021               9 Paratyphoid fever A                  
 6 0022               9 Paratyphoid fever B                  
 7 0023               9 Paratyphoid fever C                  
 8 0029               9 Paratyphoid fever, unspecified       
 9 0030               9 Salmonella gastroenteritis           
10 0031               9 Salmonella septicemia                
# ℹ 109,765 more rows

Merge the dataset and find the top 3 diagnoses:

sid_dia_all <- left_join(sid_dia, sid_dia2, by = "icd_code", "icd_version")
count(sid_dia_all, long_title, sort = T)
# A tibble: 54 × 2
   long_title                                                                  n
   <chr>                                                                   <int>
 1 Acute on chronic systolic (congestive) heart failure                        3
 2 Hyperlipidemia, unspecified                                                 3
 3 Long term (current) use of insulin                                          3
 4 Other chronic pain                                                          3
 5 Atherosclerotic heart disease of native coronary artery without angina…     2
 6 Cardiomyopathy, unspecified                                                 2
 7 Hyperkalemia                                                                2
 8 Hypertensive heart disease with heart failure                               2
 9 Hypotension, unspecified                                                    2
10 Hypovolemia                                                                 2
# ℹ 44 more rows
sid_proce_all <- left_join(sid_proce, sid_proce2, 
                           by = "icd_code", "icd_version")
sid_proce_all$chartdate <- as.POSIXct(sid_proce$chartdate)

Create the plot:

ggplot() +
  geom_point(
    data = sid_proce_all,
    aes(x = chartdate, y = "Procedure", shape = long_title)
  ) +
  geom_point(
    data = sid_lab,
    aes(x = charttime, y = "Lab"), shape = 3
  ) +
  geom_segment(
    data = sid_adt,
    aes(x = intime, xend = outtime, y = "ADT", yend = "ADT", color = careunit),
    linewidth = sid_adt$linewidth
  ) +
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT")) +
  labs(
    x = "Calendar Time",
    y = "",
    color = "Care Unit",
    shape = "Procedure",
    title = paste(
      "Patient ", "10013310, ", "F, ",
      "70 ", "years old, ", "Black/African"
    ),
    subtitle = "Acute on chronic systolic (congestive) heart failure \nHyperlipidemia, unspecified \nLong term (current) use of insulin \nOther chronic pain"
  ) +
  theme(
    legend.position = "bottom",
    legend.box = "vertical",
    legend.title = element_text(size = 9)
  ) +
  guides(
    shape = guide_legend(nrow = 9, byrow = TRUE)
  ) +
  scale_shape_manual(values = 1:9)

ggsave("10013310_adt_lab_proc.png", width = 10, height = 8)

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer:

The patient’s subject_id is 10013310.

sid <- 10013310
sid_icu <- read_csv("~/mimic/icu/icustays.csv.gz", show_col_types = FALSE) |>
  filter(subject_id == sid) |>
  print(width = Inf)
# A tibble: 2 × 8
  subject_id  hadm_id  stay_id first_careunit                                 
       <dbl>    <dbl>    <dbl> <chr>                                          
1   10013310 22098926 32769810 Neuro Surgical Intensive Care Unit (Neuro SICU)
2   10013310 27682188 31203589 Coronary Care Unit (CCU)                       
  last_careunit            intime              outtime               los
  <chr>                    <dttm>              <dttm>              <dbl>
1 Neuro Intermediate       2153-06-10 11:55:42 2153-06-16 19:03:14  6.30
2 Coronary Care Unit (CCU) 2153-05-06 18:28:00 2153-05-07 20:47:19  1.10

Find the corresponding itemid for the vitals:

label_vitals <- read_csv("~/mimic/icu/d_items.csv.gz", 
                         show_col_types = FALSE) |>
  filter(abbreviation %in% c("HR", "NBPd", "NBPs", "RR", "Temperature F")) |>
  select(itemid, abbreviation, label) |>
  print(width = Inf)
# A tibble: 5 × 3
  itemid abbreviation  label                                
   <dbl> <chr>         <chr>                                
1 220045 HR            Heart Rate                           
2 220179 NBPs          Non Invasive Blood Pressure systolic 
3 220180 NBPd          Non Invasive Blood Pressure diastolic
4 220210 RR            Respiratory Rate                     
5 223761 Temperature F Temperature Fahrenheit               
value_vitals <- open_dataset("chartevents_pq/chartevents.parquet") |>
  filter(subject_id == sid) |>
  filter(itemid %in% c(220045, 220180, 220179, 223761, 220210)) |>
  select(subject_id, itemid, valuenum, stay_id, charttime) |>
  collect() |>
  print(width = Inf)
# A tibble: 549 × 5
   subject_id itemid valuenum  stay_id charttime          
        <int>  <int>    <dbl>    <int> <dttm>             
 1   10013310 223761     98.8 32769810 2153-06-11 01:00:00
 2   10013310 220045    113   32769810 2153-06-11 02:00:00
 3   10013310 220210     26   32769810 2153-06-11 02:00:00
 4   10013310 220179    131   32769810 2153-06-11 02:02:00
 5   10013310 220180     62   32769810 2153-06-11 02:02:00
 6   10013310 220045    121   32769810 2153-06-12 00:00:00
 7   10013310 220210     25   32769810 2153-06-12 00:00:00
 8   10013310 220179    134   32769810 2153-06-12 00:03:00
 9   10013310 220180     70   32769810 2153-06-12 00:03:00
10   10013310 223761     99   32769810 2153-06-12 01:00:00
# ℹ 539 more rows
sid_vitals <- left_join(value_vitals, label_vitals, by = "itemid") |>
  print(width = Inf)
# A tibble: 549 × 7
   subject_id itemid valuenum  stay_id charttime           abbreviation 
        <int>  <dbl>    <dbl>    <int> <dttm>              <chr>        
 1   10013310 223761     98.8 32769810 2153-06-11 01:00:00 Temperature F
 2   10013310 220045    113   32769810 2153-06-11 02:00:00 HR           
 3   10013310 220210     26   32769810 2153-06-11 02:00:00 RR           
 4   10013310 220179    131   32769810 2153-06-11 02:02:00 NBPs         
 5   10013310 220180     62   32769810 2153-06-11 02:02:00 NBPd         
 6   10013310 220045    121   32769810 2153-06-12 00:00:00 HR           
 7   10013310 220210     25   32769810 2153-06-12 00:00:00 RR           
 8   10013310 220179    134   32769810 2153-06-12 00:03:00 NBPs         
 9   10013310 220180     70   32769810 2153-06-12 00:03:00 NBPd         
10   10013310 223761     99   32769810 2153-06-12 01:00:00 Temperature F
   label                                
   <chr>                                
 1 Temperature Fahrenheit               
 2 Heart Rate                           
 3 Respiratory Rate                     
 4 Non Invasive Blood Pressure systolic 
 5 Non Invasive Blood Pressure diastolic
 6 Heart Rate                           
 7 Respiratory Rate                     
 8 Non Invasive Blood Pressure systolic 
 9 Non Invasive Blood Pressure diastolic
10 Temperature Fahrenheit               
# ℹ 539 more rows

Merge the vitals with time:

time_vitals <- left_join(sid_icu, sid_vitals, by = "stay_id") |>
  select(valuenum, abbreviation, stay_id, charttime) |>
  print(width = Inf)
# A tibble: 549 × 4
   valuenum abbreviation   stay_id charttime          
      <dbl> <chr>            <dbl> <dttm>             
 1     98.8 Temperature F 32769810 2153-06-11 01:00:00
 2    113   HR            32769810 2153-06-11 02:00:00
 3     26   RR            32769810 2153-06-11 02:00:00
 4    131   NBPs          32769810 2153-06-11 02:02:00
 5     62   NBPd          32769810 2153-06-11 02:02:00
 6    121   HR            32769810 2153-06-12 00:00:00
 7     25   RR            32769810 2153-06-12 00:00:00
 8    134   NBPs          32769810 2153-06-12 00:03:00
 9     70   NBPd          32769810 2153-06-12 00:03:00
10     99   Temperature F 32769810 2153-06-12 01:00:00
# ℹ 539 more rows
time_vitals %>%
  # mutate(charttime = charttime + lubridate::hours(6)) |>
  ggplot(
    aes(
      x = charttime,
      y = valuenum,
      color = abbreviation
    ),
    na.rm = TRUE
  ) +
  geom_line(
    linewidth = 0.1
  ) +
  geom_point(
    size = 0.8
  ) +
  geom_line() +
  facet_grid(
    abbreviation ~ stay_id,
    scales = "free",
    space = "free_x"
  ) +
  labs(
    x = "",
    y = "",
    title = str_c("Patient ", sid, " ICU stays - Vitals"),
  ) +
  theme_light() +
  scale_x_datetime(
    breaks = scales::date_breaks("12 hours"),
    labels = scales::date_format("%b %d %H:%M"),
  ) +
  theme(
    legend.position = "none",
    strip.text = element_text(size = 6),
    # panel.spacing = unit(0.5, "lines"),
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)
  )

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer:

icustays_tble <- as_tibble(read_csv("~/mimic/icu/icustays.csv.gz",
  show_col_types = FALSE
)) |>
  mutate(subject_id = as.double(subject_id)) |>
  mutate(stay_id = as.double(stay_id)) |>
  arrange(subject_id, stay_id) |>
  print(width = Inf)
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 4   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
 9   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
10   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 4 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
 9 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
10 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-12-20 14:27:41 0.948
 4 2157-11-21 22:08:00 1.12 
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2130-09-27 22:13:41 3.89 
 9 2131-03-10 18:09:21 0.859
10 2129-08-10 17:02:38 6.18 
# ℹ 73,171 more rows

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer:

cat("Number of unique subject_id: ", unique(icustays_tble$subject_id) |>
  length(), "\n")
Number of unique subject_id:  50920 
cat("Number of ICU stays: ", nrow(icustays_tble), "\n")
Number of ICU stays:  73181 

The number of ICU statys is greater than the number of unique subject_id. Therefore, a subject_id can have multiple ICU stays.

Here is the graph of the number of ICU stays per subject_id:

icustays_tble %>%
  count(subject_id) %>%
  ggplot(aes(x = n)) +
  geom_bar() +
  labs(
    x = "Number of ICU stays",
    y = "Number of patients",
    title = "Number of ICU stays per patient"
  )

Therefore, most patients have only one ICU stay, and a few patients have more than one ICU stay. It seems that with the increase of ICU stays, the number of patients decreases.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer:

admissions_tble <- as_tibble(read_csv("~/mimic/hosp/admissions.csv.gz",
  show_col_types = FALSE
)) |>
  print(width = Inf)
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer:

(a) number of admissions per patient:

admissions_tble %>%
  count(subject_id) %>%
  ggplot(aes(x = n)) +
  geom_bar() +
  labs(
    x = "Number of admissions",
    y = "Number of patients",
    title = "Number of admissions per patient"
  )

Based on the plot above, most patients have only one admission, and a few patients have more than one admission. It seems that with the increase of admissions, the number of patients decreases.

(b) admission hour:

admissions_tble %>%
  ggplot(aes(x = hour(admittime))) +
  geom_bar() +
  labs(
    x = "Admission hour",
    y = "Number of admissions",
    title = "Admission hour"
  )

Based on the plot above, the number of admissions is relatively high at 0:00,
and the number of admissions is relatively low in the morning except 7am.

(c) admission minute:

admissions_tble %>%
  ggplot(aes(x = minute(admittime))) +
  geom_bar() +
  labs(
    x = "Admission minute",
    y = "Number of admissions",
    title = "Admission minute"
  )

Based on the plot, admissions peak at 0, 15, 30, and 45 minutes, with 0 minutes having the highest count and decreasing thereafter. Admissions remain consistently low at all other minutes.

(d) length of hospital stay:

admissions_tble %>%
  mutate(los = dischtime - admittime) %>%
  ggplot(aes(x = los)) +
  geom_histogram(bins = 30) +
  labs(
    x = "Length of stay",
    y = "Number of admissions",
    title = "Length of stay"
  )
Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.

Based on the plot above, the number of admissions is relatively high when the length of stay is small, with the number of admissions decreasing as the length of stay increases. The number of admissions is relatively low when the length of stay is large.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer:

patients_tble <- as_tibble(read_csv("~/mimic/hosp/patients.csv.gz",
  show_col_types = FALSE
)) |>
  print(width = Inf)
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer:

patients_tble %>%
  ggplot(aes(x = anchor_age, fill = gender)) +
  geom_histogram(binwidth = 5, position = "dodge", alpha = 0.7) +
  labs(
    x = "Anchor Age",
    y = "Count",
    title = "Distribution of Anchor Age by Gender"
  ) +
  scale_fill_manual(values = c("blue", "pink")) +
  theme_minimal()

Based on the histogram, the distribution of anchor age is similar. When the anchor age is small, the number of female patients is slightly higher than the males. When the anchor age is large, the number of female patients is also slightly higher than the number of males. When the anchor age is in the middle, the number of female patients is slightly lower than the number of the males. Overall, the age distribution peaks at approximately 25 years, declines until around 35 years, rises again until about 55 years, and then decreases with further age increases.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer:

labevents_parquet_dataset <- open_dataset("labevents_pq/labevents.parquet") |>
  print(width = Inf)
FileSystemDataset with 1 Parquet file
labevent_id: int64
subject_id: int64
hadm_id: int64
specimen_id: int64
itemid: int64
order_provider_id: string
charttime: timestamp[ms]
storetime: timestamp[ms]
value: string
valuenum: double
valueuom: string
ref_range_lower: double
ref_range_upper: double
flag: string
priority: string
comments: string
labevents_filtered_parquet <- labevents_parquet_dataset %>%
  mutate(subject_id = as.double(subject_id)) %>%
  mutate(itemid = as.double(itemid)) %>%
  select(subject_id, itemid, storetime, valuenum) %>%
  filter(itemid %in%
    c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  arrange(subject_id, storetime) %>%
  collect()
labevents_filtered_parquet <- as_tibble(labevents_filtered_parquet) |>
  print(width = Inf)
# A tibble: 24,855,909 × 4
   subject_id itemid storetime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  51221 2180-03-23 08:19:00     45.4
 2   10000032  51301 2180-03-23 08:19:00      3  
 3   10000032  50931 2180-03-23 08:56:00     95  
 4   10000032  50882 2180-03-23 09:40:00     27  
 5   10000032  50902 2180-03-23 09:40:00    101  
 6   10000032  50912 2180-03-23 09:40:00      0.4
 7   10000032  50971 2180-03-23 09:40:00      3.7
 8   10000032  50983 2180-03-23 09:40:00    136  
 9   10000032  51221 2180-05-06 15:42:00     42.6
10   10000032  51301 2180-05-06 15:42:00      5  
# ℹ 24,855,899 more rows

Merge the stay_id with the dataset above:

labevents_merge <- inner_join(
  labevents_filtered_parquet,
  icustays_tble,
  by = "subject_id"
) |>
  filter(storetime < intime) |>
  group_by(subject_id, stay_id, itemid) %>%
  arrange(subject_id, stay_id, storetime) %>%
  slice_tail(n = 1) |>
  ungroup()
Warning in inner_join(labevents_filtered_parquet, icustays_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1905 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
labevents_d_item <- read_csv("~/mimic/hosp/d_labitems.csv.gz",
  show_col_types = FALSE
) |>
  filter(itemid %in%
    c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) |>
  select(itemid, label) |>
  print(width = Inf)
# A tibble: 8 × 2
  itemid label            
   <dbl> <chr>            
1  50882 Bicarbonate      
2  50902 Chloride         
3  50912 Creatinine       
4  50931 Glucose          
5  50971 Potassium        
6  50983 Sodium           
7  51221 Hematocrit       
8  51301 White Blood Cells
labevents_d <- left_join(
  labevents_merge,
  labevents_d_item,
  by = "itemid"
) |>
  collect() |>
  print(width = Inf)
# A tibble: 525,174 × 12
   subject_id itemid storetime           valuenum  hadm_id  stay_id
        <dbl>  <dbl> <dttm>                 <dbl>    <dbl>    <dbl>
 1   10000032  50882 2180-07-23 00:59:00     25   29079034 39553978
 2   10000032  50902 2180-07-23 00:59:00     95   29079034 39553978
 3   10000032  50912 2180-07-23 00:59:00      0.7 29079034 39553978
 4   10000032  50931 2180-07-23 00:59:00    102   29079034 39553978
 5   10000032  50971 2180-07-23 01:14:00      6.7 29079034 39553978
 6   10000032  50983 2180-07-23 00:59:00    126   29079034 39553978
 7   10000032  51221 2180-07-23 00:00:00     41.1 29079034 39553978
 8   10000032  51301 2180-07-23 00:00:00      6.9 29079034 39553978
 9   10000980  50882 2189-06-27 00:56:00     21   26913865 39765666
10   10000980  50902 2189-06-27 00:56:00    109   26913865 39765666
   first_careunit                     last_careunit                     
   <chr>                              <chr>                             
 1 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 2 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 3 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 4 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 5 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 6 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 7 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 8 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 9 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
10 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
   intime              outtime               los label            
   <dttm>              <dttm>              <dbl> <chr>            
 1 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 Bicarbonate      
 2 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 Chloride         
 3 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 Creatinine       
 4 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 Glucose          
 5 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 Potassium        
 6 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 Sodium           
 7 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 Hematocrit       
 8 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410 White Blood Cells
 9 2189-06-27 08:42:00 2189-06-27 20:38:27 0.498 Bicarbonate      
10 2189-06-27 08:42:00 2189-06-27 20:38:27 0.498 Chloride         
# ℹ 525,164 more rows
labevents_tble <- labevents_d %>%
  pivot_wider(
    id_cols = c(subject_id, stay_id),
    names_from = label,
    values_from = valuenum,
    values_fn = list(valuenum = ~ .[1])
  ) %>%
  # rename_with(~str_to_lower(.), everything())
  rename(
    "wbc" = "White Blood Cells",
    "bicarbonate" = "Bicarbonate",
    "chloride" = "Chloride",
    "creatinine" = "Creatinine",
    "glucose" = "Glucose",
    "potassium" = "Potassium",
    "sodium" = "Sodium",
    "hematocrit" = "Hematocrit"
  )
labevents_tble
# A tibble: 68,467 × 10
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer:

First, make a symbolic link for chartevents_pq folder at the current working directory to the Parquet format I generated in Homework2.

ln -s ~/203b-hw/hw2/chartevents.parquet chartevents_pq
chartevents_parquet_dataset <- open_dataset("chartevents_pq/chartevents.parquet") |>
  print(width = Inf)
FileSystemDataset with 1 Parquet file
subject_id: int64
hadm_id: int64
stay_id: int64
caregiver_id: int64
charttime: timestamp[ms]
storetime: timestamp[ms]
itemid: int64
value: string
valuenum: double
valueuom: string
warning: int64
chartevents_filtered_parquet <- chartevents_parquet_dataset %>%
  mutate(subject_id = as.integer(subject_id)) %>%
  mutate(itemid = as.double(itemid)) %>%
  select(subject_id, itemid, charttime, valuenum) %>%
  filter(itemid %in%
    c(220045, 220179, 220180, 223761, 220210)) %>%
  arrange(subject_id, charttime) %>%
  collect()
chartevents_filtered_parquet <- as_tibble(chartevents_filtered_parquet) |>
  print(width = Inf)
# A tibble: 22,504,119 × 4
   subject_id itemid charttime           valuenum
        <int>  <dbl> <dttm>                 <dbl>
 1   10000032 223761 2180-07-23 07:00:00     98.7
 2   10000032 220179 2180-07-23 07:11:00     84  
 3   10000032 220180 2180-07-23 07:11:00     48  
 4   10000032 220045 2180-07-23 07:12:00     91  
 5   10000032 220210 2180-07-23 07:12:00     24  
 6   10000032 220045 2180-07-23 07:30:00     93  
 7   10000032 220179 2180-07-23 07:30:00     95  
 8   10000032 220180 2180-07-23 07:30:00     59  
 9   10000032 220210 2180-07-23 07:30:00     21  
10   10000032 220045 2180-07-23 08:00:00     94  
# ℹ 22,504,109 more rows

Merge the stay_id with the dataset above:

chartevents_merge <- inner_join(
  chartevents_filtered_parquet,
  icustays_tble,
  by = "subject_id"
) |>
  filter(charttime >= intime & charttime <= outtime) |>
  group_by(subject_id, stay_id, itemid) %>%
  arrange(subject_id, stay_id, charttime) %>%
  slice_head(n = 1) |>
  ungroup()
Warning in inner_join(chartevents_filtered_parquet, icustays_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 91 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
chartevents_d_item <- read_csv("~/mimic/icu/d_items.csv.gz",
  show_col_types = FALSE
) |>
  filter(itemid %in%
    c(220045, 220179, 220180, 223761, 220210)) |>
  select(itemid, label) |>
  print(width = Inf)
# A tibble: 5 × 2
  itemid label                                
   <dbl> <chr>                                
1 220045 Heart Rate                           
2 220179 Non Invasive Blood Pressure systolic 
3 220180 Non Invasive Blood Pressure diastolic
4 220210 Respiratory Rate                     
5 223761 Temperature Fahrenheit               
chartevents_d <- inner_join(
  chartevents_merge,
  chartevents_d_item,
  by = "itemid"
) |>
  collect() |>
  print(width = Inf)
# A tibble: 362,465 × 12
   subject_id itemid charttime           valuenum  hadm_id  stay_id
        <dbl>  <dbl> <dttm>                 <dbl>    <dbl>    <dbl>
 1   10000032 220045 2180-07-23 07:12:00     91   29079034 39553978
 2   10000032 220179 2180-07-23 07:11:00     84   29079034 39553978
 3   10000032 220180 2180-07-23 07:11:00     48   29079034 39553978
 4   10000032 220210 2180-07-23 07:12:00     24   29079034 39553978
 5   10000032 223761 2180-07-23 07:00:00     98.7 29079034 39553978
 6   10000980 220045 2189-06-27 01:56:00     77   26913865 39765666
 7   10000980 220179 2189-06-27 01:55:00    150   26913865 39765666
 8   10000980 220180 2189-06-27 01:55:00     77   26913865 39765666
 9   10000980 220210 2189-06-27 01:54:00     23   26913865 39765666
10   10000980 223761 2189-06-27 02:07:00     98   26913865 39765666
   first_careunit                     last_careunit                     
   <chr>                              <chr>                             
 1 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 2 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 3 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 4 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 5 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 6 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 7 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 8 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
 9 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
10 Medical Intensive Care Unit (MICU) Medical Intensive Care Unit (MICU)
   intime              outtime               los
   <dttm>              <dttm>              <dbl>
 1 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410
 2 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410
 3 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410
 4 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410
 5 2180-07-23 14:00:00 2180-07-23 23:50:47 0.410
 6 2189-06-27 08:42:00 2189-06-27 20:38:27 0.498
 7 2189-06-27 08:42:00 2189-06-27 20:38:27 0.498
 8 2189-06-27 08:42:00 2189-06-27 20:38:27 0.498
 9 2189-06-27 08:42:00 2189-06-27 20:38:27 0.498
10 2189-06-27 08:42:00 2189-06-27 20:38:27 0.498
   label                                
   <chr>                                
 1 Heart Rate                           
 2 Non Invasive Blood Pressure systolic 
 3 Non Invasive Blood Pressure diastolic
 4 Respiratory Rate                     
 5 Temperature Fahrenheit               
 6 Heart Rate                           
 7 Non Invasive Blood Pressure systolic 
 8 Non Invasive Blood Pressure diastolic
 9 Respiratory Rate                     
10 Temperature Fahrenheit               
# ℹ 362,455 more rows
chartevents_tble <- chartevents_d %>%
  pivot_wider(
    id_cols = c(subject_id, stay_id),
    names_from = label,
    values_from = valuenum,
    values_fn = list(valuenum = ~ .[1])
  ) %>%
  # rename_with(~str_to_lower(.), everything())
  rename(
    "heart rate" = "Heart Rate",
    "non_invasive_blood_pressure_systolic" = "Non Invasive Blood Pressure systolic",
    "non_invasive_blood_pressure_diastolic" = "Non Invasive Blood Pressure diastolic",
    "temperature_fahrenheit" = "Temperature Fahrenheit",
    "respiratory rate" = "Respiratory Rate"
  )
chartevents_tble
# A tibble: 73,164 × 7
   subject_id stay_id `heart rate` non_invasive_blood_p…¹ non_invasive_blood_p…²
        <dbl>   <dbl>        <dbl>                  <dbl>                  <dbl>
 1   10000032  3.96e7           91                     84                     48
 2   10000980  3.98e7           77                    150                     77
 3   10001217  3.46e7           96                    167                     95
 4   10001217  3.71e7           86                    151                     90
 5   10001725  3.12e7           55                     73                     56
 6   10001884  3.75e7           38                    180                     12
 7   10002013  3.91e7           80                    104                     70
 8   10002155  3.11e7           94                    118                     51
 9   10002155  3.24e7           98                    109                     65
10   10002155  3.37e7           68                    126                     61
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹​non_invasive_blood_pressure_systolic,
#   ²​non_invasive_blood_pressure_diastolic
# ℹ 2 more variables: `respiratory rate` <dbl>, temperature_fahrenheit <dbl>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer:

mimic_icu_cohort <- icustays_tble %>%
  inner_join(
    admissions_tble,
    by = c("subject_id", "hadm_id")
  ) |>
  inner_join(
    patients_tble,
    by = "subject_id"
  ) |>
  left_join(
    labevents_tble,
    by = "stay_id"
  ) |>
  left_join(
    chartevents_tble,
    by = "stay_id"
  ) |>
  select(-c(subject_id, subject_id.y)) |>
  mutate(age_intime = year(intime) - anchor_year + anchor_age) |>
  filter(age_intime >= 18) |>
  rename("subject_id" = "subject_id.x") |>
  print(width = Inf)
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 4   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
 9   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
10   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 4 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
 9 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
10 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
   outtime               los admittime           dischtime          
   <dttm>              <dbl> <dttm>              <dttm>             
 1 2180-07-23 23:50:47 0.410 2180-07-23 12:35:00 2180-07-25 17:55:00
 2 2189-06-27 20:38:27 0.498 2189-06-27 07:38:00 2189-07-03 03:00:00
 3 2157-12-20 14:27:41 0.948 2157-12-18 16:58:00 2157-12-24 14:55:00
 4 2157-11-21 22:08:00 1.12  2157-11-18 22:56:00 2157-11-25 18:00:00
 5 2110-04-12 23:59:56 1.34  2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20 08:27:30 9.17  2131-01-07 20:39:00 2131-01-20 05:15:00
 7 2160-05-19 17:33:33 1.31  2160-05-18 07:45:00 2160-05-23 13:30:00
 8 2130-09-27 22:13:41 3.89  2130-09-23 21:59:00 2130-09-29 18:55:00
 9 2131-03-10 18:09:21 0.859 2131-03-09 20:33:00 2131-03-10 01:55:00
10 2129-08-10 17:02:38 6.18  2129-08-04 12:44:00 2129-08-18 16:53:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  DIRECT EMER.                P99698           
 4 NA                  EW EMER.                    P4645A           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-20 05:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 NA                  EW EMER.                    P3529J           
 9 2131-03-10 21:53:00 EW EMER.                    P80515           
10 NA                  EW EMER.                    P05HUO           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 4 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
 9 EMERGENCY ROOM     DIED                         Other     ENGLISH 
10 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-23 05:54:00 2180-07-23 14:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-27 06:25:00 2189-06-27 08:42:00
 3 MARRIED        WHITE                  NA                  NA                 
 4 MARRIED        WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 13:36:00 2131-01-07 22:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2130-09-23 19:59:00 2130-09-24 00:50:00
 9 MARRIED        WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
10 MARRIED        WHITE                  2129-08-04 11:00:00 2129-08-04 12:35:00
   hospital_expire_flag gender anchor_age anchor_year anchor_year_group
                  <dbl> <chr>       <dbl>       <dbl> <chr>            
 1                    0 F              52        2180 2014 - 2016      
 2                    0 F              73        2186 2008 - 2010      
 3                    0 F              55        2157 2011 - 2013      
 4                    0 F              55        2157 2011 - 2013      
 5                    0 F              46        2110 2011 - 2013      
 6                    1 F              68        2122 2008 - 2010      
 7                    0 F              53        2156 2008 - 2010      
 8                    0 F              80        2128 2008 - 2010      
 9                    1 F              80        2128 2008 - 2010      
10                    0 F              80        2128 2008 - 2010      
   dod        bicarbonate chloride creatinine glucose potassium sodium
   <date>           <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1 2180-09-09          25       95        0.7     102       6.7    126
 2 2193-08-26          21      109        2.3      89       3.9    144
 3 NA                  30      104        0.5      87       4.1    142
 4 NA                  22      108        0.6     112       4.2    142
 5 NA                  NA       98       NA        NA       4.1    139
 6 2131-01-20          30       88        1.1     141       4.5    130
 7 NA                  24      102        0.9     288       3.5    137
 8 2131-03-10          23       98        2.8     117       4.9    135
 9 2131-03-10          26       85        1.4     133       5.7    120
10 2131-03-10          24      105        1.1     138       4.6    139
   hematocrit   wbc `heart rate` non_invasive_blood_pressure_systolic
        <dbl> <dbl>        <dbl>                                <dbl>
 1       41.1   6.9           91                                   84
 2       27.3   5.3           77                                  150
 3       37.4   5.4           96                                  167
 4       38.1  15.7           86                                  151
 5       NA    NA             55                                   73
 6       39.7  12.2           38                                  180
 7       34.9   7.2           80                                  104
 8       25.5  17.9           94                                  118
 9       22.4   9.8           98                                  109
10       39.7   7.9           68                                  126
   non_invasive_blood_pressure_diastolic `respiratory rate`
                                   <dbl>              <dbl>
 1                                    48                 24
 2                                    77                 23
 3                                    95                 11
 4                                    90                 18
 5                                    56                 19
 6                                    12                 10
 7                                    70                 14
 8                                    51                 18
 9                                    65                 23
10                                    61                 18
   temperature_fahrenheit age_intime
                    <dbl>      <dbl>
 1                   98.7         52
 2                   98           76
 3                   97.6         55
 4                   98.5         55
 5                   97.7         46
 6                   98.1         77
 7                   97.2         57
 8                   96.9         82
 9                   97.7         83
10                   95.9         81
# ℹ 73,171 more rows

Check whether the age_intime is greater than 18:

table(mimic_icu_cohort$age_intime)

  18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
  84  132  246  241  252  272  234  276  314  283  348  328  325  308  391  357 
  34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
 374  382  364  366  461  435  431  412  519  514  593  571  688  736  782  791 
  50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
 948  976 1134 1224 1276 1313 1408 1318 1399 1481 1570 1602 1622 1629 1756 1695 
  66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81 
1744 1887 1805 1695 1697 1704 1614 1588 1518 1568 1468 1510 1449 1363 1362 1370 
  82   83   84   85   86   87   88   89   90   91   92   93   94   95   96   97 
1407 1284 1305 1186 1122 1028  931  724  336 2201  468  240  194   99   54   55 
  98   99  100  102 
  24   13    6    1 

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

Answer:

Race:

Get rid of ones that are “UNABLE TO OBTAIN” and “UNKNOWN” and “PATIENT DECLINED TO ANSWER” and “OTHER”:

mimic_icu_cohort %>%
  filter(
    race != "UNABLE TO OBTAIN" &
      race != "UNKNOWN" &
      race != "PATIENT DECLINED TO ANSWER" &
      race != "OTHER"
  ) %>%
  ggplot(aes(x = los, fill = race)) +
  geom_histogram(binwidth = 30, position = "dodge", alpha = 0.7) +
  labs(
    x = "Length of ICU Stay",
    y = "Frequency",
    title = "Histogram of ICU Stay Length by Race"
  ) +
  theme(
    legend.position = "bottom",
    legend.box = "vertical"
  )

# Numerical summaries
mimic_icu_cohort %>%
  group_by(race) %>%
  summarise(
    Mean_LOS = mean(los, na.rm = TRUE),
    Median_LOS = median(los, na.rm = TRUE),
    SD_LOS = sd(los, na.rm = TRUE)
  ) %>%
  arrange(desc(Mean_LOS)) %>%
  print(width = Inf)
# A tibble: 33 × 4
   race                          Mean_LOS Median_LOS SD_LOS
   <chr>                            <dbl>      <dbl>  <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE     4.46       2.09   6.54
 2 PORTUGUESE                        4.26       2.14   6.84
 3 ASIAN - KOREAN                    4.23       2.08   6.35
 4 UNABLE TO OBTAIN                  4.22       2.09   6.29
 5 UNKNOWN                           4.20       2.18   5.66
 6 ASIAN - ASIAN INDIAN              4.20       1.95   7.61
 7 HISPANIC/LATINO - DOMINICAN       3.86       2.09   5.54
 8 BLACK/AFRICAN                     3.82       2.02   5.90
 9 WHITE - OTHER EUROPEAN            3.66       1.91   6.43
10 BLACK/CARIBBEAN ISLAND            3.61       1.95   5.13
# ℹ 23 more rows

After exploring both graphics and numerical summaries, it is evident that numerical summaries provide more informative insights. The highest mean length of stay is observed among American Indian/Alaska Native individuals, while the lowest mean length of stay is observed among Hispanic/Latino - Mexican individuals.

insurance

Check the categories of insurance types:

table(mimic_icu_cohort$insurance)

Medicaid Medicare    Other 
    5528    33091    34562 
mimic_icu_cohort %>%
  ggplot(aes(x = los, fill = insurance)) +
  geom_histogram(bins = 30, position = "dodge") +
  labs(
    x = "Length of ICU Stay",
    y = "Frequency",
    title = "Histogram of ICU Stay Length by Insurance"
  ) +
  theme(
    legend.position = "bottom",
    legend.box = "vertical"
  )

The histogram indicates a similar distribution of ICU stay length across all insurance types. For each type, the trend shows a decrease as the length of ICU stay increases.

marital_status

Check the categories of marital status:

table(mimic_icu_cohort$marital_status)

DIVORCED  MARRIED   SINGLE  WIDOWED 
    5404    32768    20858     9038 
mimic_icu_cohort %>%
  ggplot(aes(x = los, fill = marital_status, na.rm = TRUE)) +
  geom_histogram(bins = 30, position = "dodge") +
  labs(
    x = "Length of ICU Stay",
    y = "Frequency",
    title = "Histogram of ICU Stay Length by Marital Status"
  ) +
  theme(
    legend.position = "bottom",
    legend.box = "vertical"
  )

Based on the histogram, the distribution of ICU stay length is similar across all marital statuses. For each status, the trend shows a decrease as the length of ICU stay increases.

gender

Check the categories:

table(mimic_icu_cohort$gender)

    F     M 
32363 40818 
mimic_icu_cohort %>%
  ggplot(aes(x = los, fill = gender)) +
  geom_histogram(bins = 30, position = "dodge") +
  labs(
    x = "Length of ICU Stay",
    y = "Frequency",
    title = "Histogram of ICU Stay Length by Gender"
  ) +
  theme(
    legend.position = "bottom",
    legend.box = "vertical"
  )

The histogram indicates a similar distribution of ICU stay length across males and females. For each gender, the trend shows a decrease as the length of ICU stay increases.

age at intime

mimic_icu_cohort %>%
  ggplot(aes(x = age_intime, y = los)) +
  geom_point(aes(x = age_intime, y = los), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    x = "Age at Intime",
    y = "Length of ICU Stay",
    title = "Scatterplot of ICU Stay Length vs Age at Intime"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Therefore, the scatterplot indicates that there is no relationship between age at intime and length of ICU stay. The linear regression line is almost flat, indicating no relationship between the two variables.

mimic_icu_cohort %>%
  group_by(age_intime) %>%
  summarise(
    Mean_LOS = mean(los, na.rm = TRUE),
    Median_LOS = median(los, na.rm = TRUE),
    SD_LOS = sd(los, na.rm = TRUE)
  ) %>%
  arrange(desc(Mean_LOS)) %>%
  print(width = Inf)
# A tibble: 84 × 4
   age_intime Mean_LOS Median_LOS SD_LOS
        <dbl>    <dbl>      <dbl>  <dbl>
 1         27     4.61       1.91   8.95
 2         47     3.99       2.02   6.83
 3         44     3.89       1.88   6.34
 4         28     3.76       1.69   6.88
 5         58     3.75       1.95   5.92
 6         41     3.73       1.94   4.96
 7         70     3.73       1.98   6.35
 8         75     3.72       2.02   5.28
 9         77     3.71       2.17   4.86
10         80     3.71       2.15   4.55
# ℹ 74 more rows

Based on the numerical summaries, the highest mean length of stay is observed among individuals aged 27, while the lowest mean length of stay is observed among individuals aged 102.

  • Length of ICU stay los vs the last available lab measurements before ICU stay

Answer:

lab_measurements <- c(
  "creatinine",
  "potassium",
  "sodium",
  "chloride",
  "bicarbonate",
  "hematocrit",
  "wbc",
  "glucose"
)
library(purrr)
walk(lab_measurements, ~ {
  plot_data <- mimic_icu_cohort %>%
    ggplot(aes(x = !!sym(.x), y = los)) +
    geom_point(alpha = 0.7) +
    geom_smooth(method = "lm", se = FALSE, color = "blue") +
    labs(
      x = .x,
      y = "Length of ICU Stay",
      title = str_c("Scatterplot of ICU Stay Length vs ", .x)
    ) +
    theme_minimal()

  print(plot_data)
})
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5770 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 5770 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 8901 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 8901 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 8872 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 8872 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 8883 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 8883 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 9050 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 9050 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5017 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 5017 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5094 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 5094 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 9099 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 9099 rows containing missing values (`geom_point()`).

Based on the scatterplots, there is slightly positive relationship between the last available lab measurements before ICU stay —— creatinine, wbc and the length of ICU stay, with the linear regression line showing a slight positive slope, which means that the length of ICU stay increases as the value of creatinine and wbc increases. potassium, sodium, bicarbonate, and glucose show no relationship with the length of ICU stay, as the linear regression lines are almost flat. chloride and hematocrit show a slightly negative relationship with the length of ICU stay, as the linear regression lines show a slight negative slope, which means that the length of ICU stay decreases as the value of chloride and hematocrit increases.

  • Length of ICU stay los vs the first vital measurements within the ICU stay

Answer:

vital_measurements <- c(
  "heart rate",
  "non_invasive_blood_pressure_systolic",
  "non_invasive_blood_pressure_diastolic",
  "temperature_fahrenheit",
  "respiratory rate"
)
walk(vital_measurements, ~ {
  plot_data <- mimic_icu_cohort %>%
    ggplot(aes(x = !!sym(.x), y = los)) +
    geom_point(alpha = 0.7) +
    geom_smooth(method = "lm", se = FALSE, color = "blue") +
    labs(
      x = .x,
      y = "Length of ICU Stay",
      title = str_c("Scatterplot of ICU Stay Length vs ", .x)
    ) +
    theme_minimal()

  print(plot_data)
})
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 18 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 979 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 979 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 983 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 983 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1362 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 1362 rows containing missing values (`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 98 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 98 rows containing missing values (`geom_point()`).

Based on the scatterplots, there is no relationship between the first vital measurements within the ICU stay – none invasive blood pressure diastolic and the length of ICU stay. The linear regression line is almost flat, indicating no relationship between the two variables. heart rate, temperature and respiratory rate show a slightly positive relationship with the length of ICU stay, as the linear regression lines show a slight positive slope, which means that the length of ICU stay increases as the value of heart rate, temperature and respiratory rate increases. non invasive blood pressure systolic shows a slightly negative relationship with the length of ICU stay, as the linear regression line shows a slight negative slope, which means that the length of ICU stay decreases as the value of non invasive blood pressure systolic increases.

  • Length of ICU stay los vs first ICU unit

Answer:

mimic_icu_cohort %>%
  ggplot(aes(x = first_careunit, y = los)) +
  geom_boxplot() +
  labs(
    x = "First ICU Unit",
    y = "Length of ICU Stay",
    title = "Boxplot of ICU Stay Length by First ICU Unit"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 6))

Based on the boxplot, the distribution of ICU stay length is similar across all first ICU units. The average of ICU stay length is similar across all first ICU units, with the highest average observed in the Neuro SICU.

The boxplot is not very informative, so I will use numerical summary, which again shows that the average length of stay is similar across all first ICU units. The highest average length of stay is observed in the Neuro SICU, while the lowest average length of stay is observed in the Cardiac SICU.

mimic_icu_cohort %>%
  group_by(first_careunit) %>%
  summarise(
    Mean_LOS = mean(los, na.rm = TRUE),
    Median_LOS = median(los, na.rm = TRUE),
    SD_LOS = sd(los, na.rm = TRUE)
  ) %>%
  arrange(desc(Mean_LOS)) %>%
  print(width = Inf)
# A tibble: 9 × 4
  first_careunit                                   Mean_LOS Median_LOS SD_LOS
  <chr>                                               <dbl>      <dbl>  <dbl>
1 Neuro Surgical Intensive Care Unit (Neuro SICU)      6.30       3.65   7.46
2 Surgical Intensive Care Unit (SICU)                  3.84       1.97   5.76
3 Trauma SICU (TSICU)                                  3.83       1.94   5.52
4 Neuro Intermediate                                   3.42       2.34   3.73
5 Cardiac Vascular Intensive Care Unit (CVICU)         3.29       1.99   4.89
6 Medical Intensive Care Unit (MICU)                   3.26       1.83   4.61
7 Coronary Care Unit (CCU)                             3.19       2.01   4.00
8 Medical/Surgical Intensive Care Unit (MICU/SICU)     3.08       1.81   4.30
9 Neuro Stepdown                                       2.59       1.69   2.89